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Abstract. We consider the problem of efficiently designing sets (codes) 
of equal-length DNA strings (words) that satisfy certain combinato- 
rial constraints. This problem has numerous motivations including DNA 
computing and DNA self-assembly. Previous work has extended results 
from coding theory to obtain bounds on code size for new biologically 
motivated constraints and has applied heuristic local search and genetic 
algorithm techniques for code design. This paper proposes a natural op- 
timization formulation of the DNA code design problem in which the 
goal is to design n strings that satisfy a given set of constraints while 
minimizing the length of the strings. For multiple sets of constraints, we 
provide high-probability algorithms that run in time polynomial in n and 
any given constraint parameters, and output strings of length within a 
constant factor of the optimal. To the best of our knowledge, this work 
is the first to consider this type of optimization problem in the context 
of DNA code design. 



1 Introduction 

In this paper we study the problem of efficiently designing sets (codes) of DNA 
strings (words) of near optimal length that fulfill certain combinatorial con- 
straints. Many applications have emerged in recent years that depend on the 
scalable design of such words. One such problem is in DNA computing where 
inputs to computational problems are encoded into DNA strands for the purpose 
of computing via DNA complementary binding [1]. Another application involves 
implementing Wang tile self-assembly systems by encoding glues of Wang tiles 
into strands of DNA [17]. DNA words can also be used to store information at 
the molecular level [4], act as molecular bar codes for identifying molecules in 
complex libraries [4, 5, 13], or implement DNA arrays [3]. 

For a set of DNA words to be effective for the above applications, they must 
fulfill a number of combinatorial constraints. Of particular importance is the 
need for specific hybridization between a given word and its unique Watson- 
Crick complement. That is, we need to make sure that hybridization does not 
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Table 1. This table summarizes our results regarding the efficient design of DNA 
words. Here n is the number of words; k denotes the maximum of the constraint 
parameters for constraints 1 through 6 (see Section 2); and £ = 0(k + logn) denotes 
the optimal achievable word length for the listed word design problems (see Theorems 1, 
3, 4 and 6). 



occur among a word and the complement of a different word in the set, or 
even of any word with any other word in the set. For this requirement Marathe 
et al. [12] have proposed the basic Hamming constraint, reverse complement 
Hamming constraint, and self- complementary constraint. We further consider the 
more restricting shifting Hamming constraint which requires a large Hamming 
distance between all alignments of any pair of words [6] . 

We also consider three constraints not related to Hamming distance. The 
consecutive base constraint limits the length of any run of identical bases in any 
given word. Long runs of identical bases are considered to cause hybridization 
errors [6,14]. The GC content constraint requires that a large percentage of 
the bases in any given word are either G or C. This constraint is meant to give 
each string similar thermodynamic properties [14-16]. The free energy constraint 
requires that the difference in free energy of any two words is bounded by a 
small constant. This helps ensure that each word in the set has a similar melting 
temperature [6,12]. 

In addition to the above constraints, it is desirable for the length I of each 
word to be as small as possible. The motivation for minimizing I is evident from 
the fact that it is more difficult to synthesize longer strands. Similarly, longer 
DNA strands require more DNA to be used for the respective application. 

There has been much previous work in the design of DNA words [4, 6, 9— 
13, 15, 16]. In particular, Marathe et al. [12] have extended results from coding 
theory to obtain bounds on code size for various biologically motivated con- 
straints. However, most work in this area has been based on heuristics, genetic 
algorithms, and stochastic local searches that do not provide provably good 
words provably fast. 



In this work we provide algorithms with analytical guarantees for combina- 
torial structures and time complexity. In particular, we formulate an optimiza- 
tion problem that takes as input a desired number of strings n and produces n 
length-^ strings that satisfy a specified set of constraints, while at the same time 
minimizing the length t. We restrict our solution to this problem in two ways. 
First, we require that our algorithms run in time only polynomial in the number 
of strings n as well as any given constraint parameters. Second, we require that 
our algorithms produce sets of words that achieve word length I that is within a 
constant multiple of the optimal achievable word length, while at the same time 
fulfilling the respective constraints with high probability. For various subsets of 
the constraints we propose, we provide algorithms that do this. We thus provide 
fast algorithms for the creation of sets of short words. 

Paper Layout: In Section 2, we describe the different biologically motivated 
combinatorial constraints we use. In Section 3 we solve the design problem with 
subsets of constraints including the Hamming constraints, the consecutive bases 
constraint, and the GC content constraint. In Section 4 we extend our algorithms 
to deal with the free energy constraint. 

2 Preliminaries 

2.1 Notations 

Let X = x\Xi . . .xe be a word where Xi belongs to some alphabet 77. In this 
paper we deal with two alphabets, namely, the binary alphabet TLb = {0, 1} and 
the DNA alphabet LTjj = {A,C,G,T}. The elements of an alphabet are called 
characters. We will use capital letters for words and small letters for characters. 
Our goal is to design DNA words but some of our algorithms generate binary 
words in intermediate steps. 

The reverse of X, denoted by X R , is the word xexe-i ...x\. The complement 
of a character x is denoted by x c . The complements for the binary alphabet are 
given by C = 1, l c = O, and for the DNA alphabet we have A c =T,C C = G, 
G c = C,T C = A. 

The complement of a word is obtained by taking the complement of each 
of the characters in the word, i.e., X c = X-yX^ • • • X/> . The reverse complement 
of X is the complement of X R , X RC — x c l x c l _ 1 .. .x\. The Hamming distance 
H(X, Y) between two words X and Y is the number of positions where X differs 
from Y. 

We are interested in designing a set W of n words over II d each of length I 
which satisfy the constraints defined in Section 2.2 below. 

2.2 Constraints 

The constraints we consider can be classified into two categories: non-interaction 
constraints and stability constraints. Non-interaction constraints ensure that un- 
wanted hybridizations between two DNA strands are avoided, and stability con- 
straints ensure that the DNA strands are stable in a solution. The first six 



constraints below are non-interaction constraints while the remaining three are 
stability constraints. 



Ci(fci): Basic Hamming Constraint (fci) = for any words Y,Xe W, H(Y, X) 
> fci- 

This constraint limits non-specific hybridizations between the Watson-Crick 
complement of some word Y with a distinct word X. 
6*2(^:2): Reverse Complementary Constraint (fo) = for any words Y,Xe 
W, H(Y,X RC ) > k 2 . 

This constraint is intended to limit hybridization between a word and the 
reverse of another word. 
Cz{kz)'- Self Complementary Constraint (£3) = for any word Y, H(Y, Y ) > 
k 3 . 

This constraint prevents a word from hybridizing with itself. 
Ci(ki): Shifting Hamming Constraint (fct) = for any two words Y, X G W, 

H(Y[l..i],X[(£-i + l)..£}) >k 4 -(£-i) for alii. 

This is a stronger version of the Basic Hamming Constraint. 
Cs(ks): Shifting Reverse Complementary Constraint (£5) = for any two 
words Y, X e W, 

H(Y[l..i],X[l..i] RC ) >k 5 -(£-i) for alii; and 

H(Y[(£ - i + !)..£}, X[(£ - i + 1)..£} RC ) > k 5 ~{£-i) for all i. 

This is a stronger version of the Reverse Complementary Constraint. 
C%{k%): Shifting Self Complementary Constraint (k e ) = for any word Y e 

H(Y[l..i], Y[l..i] RC ) >k 6 ~{£-i) for all i; and 

H{Y[(£ - i + 1)..£},Y[(£ - i + 1)..£] RC ) > k e - (£ - i) for all i. 

This is a stronger version of the Self Complementary Constraint. 
6*7(7): GC Content Constraint (7) = 7 percentage of bases in any word 
FeW are either G or C. 

The GC content affects the thermodynamic properties of a word [14-16]. 
Therefore, having the same ratio of GC content for all the words will assure 
similar thermodynamic characteristics. 
Cs(d): Consecutive Base Constraint (d) — no word has more than d consec- 
utive bases for d > 2. 

In some applications, consecutive occurrences (also known as runs) of the 
same base increase the number of annealing errors. 
Cg(a): Free Energy Constraint (cr) = for any two words Y, X € W, FE(Y") — 
FE(X) < a where FE(FF) denotes the free energy of a word defined in 
Section 4. 

This constraint ensures that all the words in the set have similar melting 
temperatures which allows hybridization of multiple DNA strands to proceed 
simultaneously [13]. 



For each of the given constraints above we assign a shorthand boolean func- 
tion d (t) to denote whether or not a given set of words W fulfills constraint Cj 
with respect to parameter t. For a given integer n, the goal of DNA word design 
is to efficiently create a set of n length-^ words such that a given subset of the 
above constraints are satisfied, while trying to minimize I. That is, for a given 
subset of constraints {C V1 , C W2 , . . . , C Vr } C {C\, C 2 , . . . , C 9 }, the corresponding 
DNA word design (DWD) optimization problem is as follows. 

Problem 1 (DWD^ 2t ...^ r ). 
Input: Integers n, t\, t2, ■ ■ ■ , t r . 

Output: A set W of n DNA strings each of the minimum length such that for 
all 1 < i < r the constraint C Vi (ti) is satisfied over set W. 

For this problem we have the following trivial lower bounds for time com- 
plexity and the word size I when any one of the first six constraints is applied. 

Theorem 1. Consider a set W of n DNA words each of length I. 

1. IfW fulfills any one of the constraints Ci(fc), C2(fc), C^{k), Ci{k),C^{k), and 
C 6 (k), then I = Q{k + logn). 

2. The time complexity of producing a set W that fulfills any one of the con- 
straints C\(k), Ciik), C-$(k), Ci{k), C§{k), and C§{k) is fl(nk + nlogn). 

The goal of DNA word design is to simultaneously satisfy as many of the 
above nine constraints as possible while achieving words within a constant factor 
of the optimal length I for the given set of constraints. In Section 3 we show how 
to accomplish this goal for various subsets of the constraints. 

3 Algorithms for DNA Word Design 

In this section we develop randomized algorithms to generate sets of length-^ 
DNA words that satisfy certain sets of constraints while keeping I within a con- 
stant of the optimal value. In particular, we first show how simply generating 
a set of n words at a specific length I = 0(k + logn) uniformly at random 
is sufficient to fulfill constraints 1, 2, 3, 4, 5, and 6 simultaneously with high 
probability. We then propose three extensions to this algorithm to fulfill differ- 
ent subsets of constraints within a constant factor of the optimal word length. 
The first extension yields an algorithm for fulfilling the GC content constraint 
while the second yields one for the consecutive base and GC content constraints 
at the cost of the shifting constraints. Finally, we extend the basic randomized 
algorithm to fulfill the free energy constraint. The first is thus an algorithm for 
simultaneously fulfilling constraints 1, 2, 3, 4, 5, 6, and 7, the second simultane- 
ously fulfills constraints 1, 2, 3, 7, and 8, and the last one fulfills constraints 1, 
2, 3, 4, 5, 6 and 9. 



Algorithm Fasi.DVK.Di, 2, 3,4,5, 6 (n, fei, fe, fe, fc 4 , fcs, fee) 

1. Let k = max{fci, fe, ki, k4, fes, fee}. 

2. Generate a set W of n words over IId of length I = 9- max{A;, [log 4 n] } uniformly 
at random. 

3. Output VV. 

Fig. 1. A randomized algorithm for generating n DNA strings satisfying constraints 
Ci(fci), Ca(fca), C 3 (*s), C 4 (*4), C B (fc B ), and C 6 (fc 6 ). 

3.1 A Simple Randomized Algorithm 

Problem 2 (DWDi ,2,3,4,5,6,). 

Input: Integers n, fci, fc2, £3, &4, fes, fc 6 . 

Output: A set VV of n DNA strings each of the minimum length such that the 
constraints C\{k\), C 2 (k 2 ), C 3 (fc 3 ), C 4 (fc 4 ), C 5 (fc 5 ), C 6 (fc 6 ) hold. 

The next theorem shows that Algorithm FastDWDi, 2,3,4,5, 6 (n, &i, k 2 , &3, &4, 
k$, fee) in Figure 1 yields a polynomial-time solution to the DW D\ t 2,3,4,5,6 prob- 
lem with high probability. We omit the proof in the interest of space. 

Theorem 2. Algorithm FastDWDi^.3,4.5,6 produces a set VV ofn DNA words of 
optimal length 6>(fc+logn) in optimal time 0(n-k+n- logn) satisfying constraints 
Ci(fci), C 2 {k 2 ), Cs(ks), C 4 (fc 4 ), C^{k^) and Ce(ke) with probability of failure 
o(l/(n + 4 fc )) ; where k — maxjfci, k 2 , k 3 , k±, k$, k^}. 

Proof (Sketch). The probability that two random words violate any of the con- 
straints Ci(fci), C 2 (k 2 ), C^ki), and Cs(fcs), can be bounded using Chernoff type 
bounds. Similarly, we can bound the probability of a random word violating any 
of the constraints Cs(ks) and C 6 (fc 6 ). 

We can then apply the Boolc-Bonfcrroni Inequalities to yield a bound on the 
probability that any pair of words in a set of n random words violates constraints 
Ci(fci), C 2 (k 2 ), Ci{ki), or C§{k§); or that any single word violates constraints 
C 3 (fc 3 ) or C 6 (ke). □ 

3.2 Incorporating the GC Content Constraint into 
FastDWD l!2 ,3,4,5,6 

Now we show how to modify Algorithm FastDWDi j2 ,3,4,5,6 so that it produces a 
set of words that also satisfies the GC content constraint. That is, we will show 
how to solve the following problem. 

Problem 3 (DWD li2 , 3 ,4,5,6,7,>- 

Input: Integers n, k\, k 2 , k 3 , fc 4 , k$,ke, 7. 

Output: A set VV of n DNA strings each of the minimum length such that the 
constraints Ci(fci), C 2 (k 2 ), C 3 (k 3 ), C 4 (fc 4 ), C 5 (k 5 ), C 6 (fc 6 ), C 7 (~f) hold. 



Algorithm FastD WD li2 , 3, 4, 5, 6, 7 (n, kx,k 2 , fc 3 , fc 4 , k 5 ,k 6 ,j) 

1. Let k — maxjfci, k 2 , ks, fc 4 , k 5 , fc 6 }. 

2. Generate a set W of n words over the binary alphabet IJb of length I — 
10- max{fc, [log 2 n]} uniformly at random. 

3. For each word W € W, for any \~(-£] characters in W, replace by G and 1 by C. 
For the remaining characters replace by A and 1 by T to get W' . Let W' be the 
set of all words W' . 

4. Output W. 

Fig. 2. A randomized algorithm for generating n DNA strings satisfying constraints 
Ci(fci), Ca(fca), C 3 (k 3 ), C 4 (fc 4 ), C 5 (k 5 ), C 6 (k 6 ), and C 7 ( 7 ). 

We modify Algorithm FastDWDi^,3,4,5,6 to get Algorithm FastDWDi^^^^^,? 
shown in Figure 2. The next theorem shows that FastDWDi.2.3,4.5,6.7 yields a 
polynomial-time solution to DWDi^,3.4. 5,6.7 with high probability. We omit the 
proof in the interest of space. 

Theorem 3. Algorithm FastDWDi !2 ,3,4,5,6,7 produces a set W of n DNA words 
of optimal length 0(k + logn) in optimal time 0(n-k + n- logn) satisfying con- 
straints Ci(fci) , C2(fc2), Cs(k 3 ), C4(fc4) ; C5(k 5 ), C§(k & ), andCj^) with proba- 
bility of failure o(l/(n + 2 k )), where k = max{fci, k 2 , k 3 , k 4 , k 5 , fc 6 }. 

3.3 Incorporating the Consecutive Bases Constraint into 
FastDWDr^a^^e.T 

Now we modify Algorithm FastDWDi !2 ,3,4,5,6,7 so that it produces a set that 
satisfies both the GC content constraint and the consecutive base constraint 
at the cost of the shifting constraints. That is, we will show how to solve the 
following problem. 

Problem 4 (TWDi, 2j3 , 7j8 J. 
Input: Integers n, k\, k 2 , ^3,7, d. 

Output: A set W of n DNA strings each of the minimum length such that the 
constraints Ci(fci), Ci(k-i), C^(k 3 ), 6*7(7), Cs{d) hold. 

We use Algorithm BreakRuns shown in Figure 3 to break long runs for a 
binary word so that it satisfies the consecutive bases constraint with parameter 
d. Intuitively what this algorithm does is for a given word X, it outputs X' by 
inserting characters at intervals of d — 1 from the left and the right in a manner 
such that there are no consecutive runs of length greater than d. We need to add 
characters from both ends to ensure that H(X,Y RC ) < H(X' ,Y' HC ) where X' 
and Y' are the respective outputs for X and Y from BreakRuns. 

We modify Algorithm FastDWDi, 2 ,3,4,5,6,7 to get Algorithm FastDWDi j2 ,3,7,8 
shown in Figure 3. The next theorem shows that FastDWDi^.3^ 8 yields a 
polynomial-time solution to DWDi^^^g with high probability. We omit the 
proof in the interest of space. 



Algorithm BreakRuns(X, d) 

1. Let X = X-1X2 ■ ■ ■ xi. For < i < [ 2( /_ 1) ] — 1, let x' e . = and cc^ = j(<j_i) • 
Let x mid = as [£/2j ■ 

2. Output X' = ai . . .Xd-iar^Xd . . . xii/ 2 }x' lnid xii/2} + i ■ ■ • ^-(d-ij-i^^-fd-i) . . . 
xi. 

Algorithm FastDWDi,2,3,7,8(n, fei, k2, &3, 7, d) 

1. Let k = max{fci, £2, fe}. 

2. Generate a set W of n words over the binary alphabet IJb of length £ — 
10- max{fc, [log 2 n}} uniformly at random. 

3. For each word W £ W, let W' = BreakRuns(W, d). Let W be the set of all words 
W. 

4. For each word W' € W', for any py-f] characters in W', replace by G and 1 by 
C. For the remaining characters replace by A and 1 by T to get W" . Let W" be 
the set of all words W" . 

5. Output W". 

Fig. 3. Algorithms for generating n DNA strings satisfying constraints Ci(fci), (^2(^2), 
Cs(*s), C 7 ( 7 ), and C 8 (d). 

Theorem 4. Algorithm FastDWDi i2 ,3,7,s produces a set W of n DNA words of 
optimal length 6>(fc+logn) in optimal time 0(n-k+n- logn) satisfying constraints 
Ci(fci), (72(^2), C3(fc 3 ), 6*7(7), andC$(d) with probability of failure o(l / (n+2 k )) , 
where k = maxjfci, fe, fc 3 }. 

4 Incorporating the Free Energy Constraint into 
FastDWD 1>2 ,3,4,5,6 

Now we give an alternate modification of Algorithm FastDWDi^.3,4.5,6 such that 
the free energy constraint is satisfied. The free-energy FE(X) of a DNA word 
X = x\X2 ■ ■ ■ xe is approximated by FE(A") = correction factor + X^j=i r Xi , Xi+1 1 
where is the pairwise free energy between base x and base y [7]. For sim- 
plicity we denote the free energy as simply the sum YliZi r XilXi+1 with respect 
to a given pairwise energy function r. Let T max and T m i n be the maximum and 
the minimum entries in r respectively. Let D = -T max — r mm . 

We now show how to satisfy the free energy constraint Cg(a) for a constant 
a = AD + -Tmax, while simultaneously satisfying constraints 1,2,3,4,5, and 6. 
That is, we show how to solve the following problem. 

Problem 5 (DWD li2 , 3,4,5,6,9 )■ 

Input: Integ erS U, A4, fc 2 ? ^3; ^5; ^6* 

Output: A set VV of n DNA strings each of the minimum length such that the 
constraints Ci(fci), C 2 (fc 2 ), OKfe), C 4 (fc 4 ), C 5 (fc 5 ), C 6 (fc 6 ), C 9 (4£» + r max ) hold. 

We modify Algorithm FastDWDi, 2,3,4, b,6 to get Algorithm FastDWDi,2,3,4,5,6,9 
shown in Figure 4 for solving £W 1)1,2,3,4,5,6,9 • The following lemmas identify the prop- 



Algorithm FastDW ' .Di,2, 3,4,5,6, 9 (™, ki, k2, ks, fci, fcs, fee) 

Let S 1 , S 2 , . . . , S* be all possible sequences of length m = 21 where £ is as defined in 
Step 2 below such that FE(S 1 ) < FE^ 2 ) < ■ ■ ■ < FE(5" tm ). For two strings X and Y 
of respective lengths £x and £y where £y is even, let X ® V be the string y[l..(£y/2)] 
y[(^r/2 + l)..£ Y ]. Let Z\ = max I {FE(5 l+1 ) - FE(S 1 )}- 

1. Let k — maxjfci, k2, fe, k^, fcs, fc (i }. 

2. Generate a set W of n DNA words of length I = 9- max{fc, [log 4 n] } uniformly at 
random. 

3. Let W max = maxxew{FE(X)} and W min = min xe w{FE(X)}. 
if W m ax - Wmin < 3D, then output W. 

else 

4. Let a = W max + 5 1 and /3 = a + A. For each Si € W, find Sj such that a < 
FE(S'i) + FE(Sj) < 13. Let W[ = & ® Sj. 

5. output W' = {W[, W' n }. 



Fig. 4. A randomized algorithm for generating n DNA strings satisfying constraints 
Ci(*i), Ca(fei), Cs(*s), C* 4 (fc 4 ), C B (fc B ), C* 6 (fc 6 ), and C 9 (4D + T max ). 

erties of symbols A, W, Wmax, Wmin, Si, Sj, a, (3, and W{ defined in Figure 4 and are 
used for proving the correctness of Algorithm FastDWDi^.a^.s.e.g • 

Lemma 1. A < 2D. 

Lemma 2. // W max - W min > 3D, then W max - W min + 2D < FE(S 4m ) - FE(S 1 ). 

Lemma 3. For each Si G W, there exists Sj such that a < FE(Si) + FE(Sj) < (3. 

Lemma 4. For all i, a - D < FE(W-) < (3 + D + T max . 

Section 4.1 discusses the details for Step 4 of the algorithm. Finally, Section 4.2 
establishes its correctness and time complexity. 

4.1 Computing Strings with Bounded Energies 

In Step 4 of Algorithm FastDWDi,2, 3,4,5,6, 9 we need to produce a set of n DNA strings 
Si, §2, ... S n , each of a given length L = m, such that Ai < FE(Sj) < Bi for some Ai, 
Bi such that Bi — Ai < A. That is, we need to solve the following problem. 

Problem 6 (Bounded- Energy Strand Generation). 
Input: 

1. Integers Ai and Bi for i = 1 to n such that 

(a) At > Wmin, 

(b) Bi < W max ; 

(c) B t - A, < A. 

2. Length L. 

Output: Strings Si, S2, ■ ■ ■ S„ each of length L and respective energy Ei such that 
Ai < Ei < Bi. 



Algorithm ConstructStrings({Ai} , {Bi}, L) 

1. Let <P <- Build (L). 

2. if n > yji^z, then V <- SlowBuild(L), else <Z> <- NULL. 

3. For each i = 1 to n, find a nonzero coefficient of X B * in some polynomial 
f a L ' b {x) £ <P such that A t < Ei < B t . 

4. For i = 1 to n, set Si = Extract(7?;, <P, 

Fig. 5. This algorithm solves the Bounded Energy Strand Generation Problem (Prob- 
lem 6). 



Our solution to this problem involves transforming the blunt of the computational 
task into the problem of polynomial multiplication. Consider the following polynomial. 

Definition 1. For any integer £>l, let fi, a ,b(x) be the polynomial J2 z =o^ zxZ where 
coefficient ( z is the number of length-l strings whose first character is a, last character 
is b, and free energy is z. 

For fe(x) = J2y a ten fe,a,b(x) the coefficient of x l denotes the number of strings of 
length £ and free energy i. As a first step towards our solution, we use a subroutine 
BUILD(L) which computes <P, the polynomials fi,,a,b{x), /[L/2J ,a.b{ x ), ■ ■ ■ , fi,a,b(x), for 
all a, 6 E 77 in O(LlogL) time. The efficient computation of these polynomials relies 
on the following recursive property. 

Lemma 5. For any integers £i,£2 > 1, 

fti_+t 2 , a ,b{x) = ^2 ft^a^ix) ■ fe 2 , d2 ,b( x ) ■ x Fd ^ d i. 
d u d 2 en 

The problem of determining the number of strings of length L and free energy E is 
considered in [12] and a dynamic programming based 0(L 2 )-time algorithm is provided. 
However, exploiting the recursive property of Lemma 5 and Fast Fourier Transforms 
[8] for polynomial multiplication the subroutine BUILD solves this problem in faster 
0(L log L) time and may be of independent interest. 

Our algorithm for Problem 6 has two phases, the build phase and the extract phase. 
The build phase constructs a data structure that permits the extract phase to be 
executed quickly. In the extract phase, an extraction routine is run n times to output 
Si for each i £ [l,n]. Since the extraction routine is executed n times and the build 
routine only once, the phase that constitutes the bottleneck for our algorithm for 
Problem 6 depends on the values of n and L. We thus provide two forks for the 
algorithm to take, one with a fast build routine and a modestly fast extract routine, 
and the other with a slower build routine but an optimally fast extract routine. In 
particular, if n is sufficiently larger than L, our algorithm for Problem 6 calls a routine 
SlowBuild(L) which improves the runtime of Extract. Otherwise, only a faster BUILD 
function is called in the first phase, leading to a slower Extract routine. The algorithm 
for Problem 6 is given in Figure 5. 

Algorithm ConstructStrings makes use of three subroutines - Build, SlowBuild 
and Extract. The procedure Build(L) computes <1>, a data structure containing for all 
a, b G 77 and a given L, the polynomials fe,a,b(x) for £ = L, [■§■] , [j] , [j\ , • • ., 1. This 
permits Extract(75, $ , <F) to obtain a length L string of energy E in time 0(L log L). 



A call to SlowBuild(L) of time complexity 0(L 15 log 0,5 L) improves the complexity of 
Extract(_E, to O(L) by computing a data structure containing for every non- 

zero term x % in a b a corresponding pair of non-zero terms x-* and x l ~^~ Fd ^' d ^ in 

fy a di and d2 b respectively. This yields the following theorem. 

Theorem 5. Algorithm ConstructStringsd^i}, {£>*}, L) solves Problem 6 in time 
0(min{nL log L, L 15 log 5 L + nL}) . 

4.2 Putting it all together for DWDi 5 2,3,4,5,6,9 

Theorem 6. Algorithm FastDWDi^, 3,4,5,6, 9 produces a set ofn DNA words of optimal 
length (9(fc + logn) in time 0(min{n£ log £, £ 1,5 log ' 5 £ + n£}) satisfying the constraints 
Ci(fei), Ci{k2), Cz{kz), C4,{ki), C 5 (k 5 ), C^fce), and Cg(4.D + r max ) with probability 
of failure o(l/(n + 4 fc )), where k = max{fci, ki, kz, k±, k 5 , ke}. 

Proof. From Theorem 2 we know that W satisfies constraints Ci(fei), ^(fe), Cz{kz), 
C 4 (fe4), C 5 (fc 5 ), and C 6 (fc 6 ) with probability of failure o(l/(n + 4 fc )). If W max - W min < 
3D, then FastDWDi,2,3,4,5,6,9 outputs W which satisfies C9(3D) and hence also satisfies 
Cg(4D-|-r max ). Otherwise, it is easy to verify that since W satisfies these six constraints, 
so does W'. From Lemma 3 we know that there always exists a string Sj as required 
in Step 4 of FastDWDi,2,3,4,5,6,9 • Further, Lemma 4 shows that W' satisfies Cq(A + 
2D + r max ). Therefore, W satisfies constraints Ci(fei), C 2 (fc 2 ), C 3 (fc 3 ), C 4 (fc 4 ), C 5 (k 5 ), 
C@(k@), and C9(4D + -T max ) with the stated failure probability. 

The length of any word W' G W' is at most 3£ where I = 0{k + logn), which is 
optimal from Theorem 1. 

Generating W takes 0(n-k + n- logn) time. The bulk of the time complexity for the 
algorithm comes from Step 4, which is analyzed in Section 4.1 to get 0(min{nL log L, 
L 1 ' 5 log ' 5 L + nL}) (see Theorem 5) where L = 0(£). □ 

5 Future Work 

A number of problems related to this work remain open. It is still unknown how to 
generate words of optimal length that simultaneously satisfy the free energy constraint 
and the consecutive bases constraint. We also have not provided a method for combining 
the consecutive bases constraint with any of the shifting constraints. 

Another open research area is the verification problem of testing whether or not 
a set of words satisfy a given set of constraints. This problem is important because 
our algorithms only provide a high-probability assurance of success. While verification 
can clearly be done in polynomial time for all of our constraints, the naive method of 
verification has a longer runtime than our algorithms for constructing the sets. Finding 
faster, non-trivial verification algorithms is an open problem. 

A third direction for future work involves considering a generalized form of the 
basic Hamming constraint. There are applications in which it is desirable to design 
sets of words such that some distinct pairs bind with one another, while others do not 
[2, 14]. In this scenario, we can formulate a word design problem that takes as input 
a matrix of pairwise requirements for Hamming distances. Determining when such a 
problem is solvable and how to solve it optimally when it is are open problems. 
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